Goto

Collaborating Authors

 smoking status


RELEAP: Reinforcement-Enhanced Label-Efficient Active Phenotyping for Electronic Health Records

Yang, Yang, Pollak, Kathryn I., Chakraborty, Bibhas, Liu, Molei, Zhou, Doudou, Hong, Chuan

arXiv.org Artificial Intelligence

Objective: Electronic health record (EHR) phenotyping often relies on noisy proxy labels, which undermine the reliability of downstream risk prediction. Active learning can reduce annotation costs, but most rely on fixed heuristics and do not ensure that phenotype refinement improves prediction performance. Our goal was to develop a framework that directly uses downstream prediction performance as feedback to guide phenotype correction and sample selection under constrained labeling budgets. Materials and Methods: We propose Reinforcement-Enhanced Label-Efficient Active Phenotyping (RELEAP), a reinforcement learning-based active learning framework. RELEAP adaptively integrates multiple querying strategies and, unlike prior methods, updates its policy based on feedback from downstream models. We evaluated RELEAP on a de-identified Duke University Health System (DUHS) cohort (2014-2024) for incident lung cancer risk prediction, using logistic regression and penalized Cox survival models. Performance was benchmarked against noisy-label baselines and single-strategy active learning. Results: RELEAP consistently outperformed all baselines. Logistic AUC increased from 0.774 to 0.805 and survival C-index from 0.718 to 0.752. Using downstream performance as feedback, RELEAP produced smoother and more stable gains than heuristic methods under the same labeling budget. Discussion: By linking phenotype refinement to prediction outcomes, RELEAP learns which samples most improve downstream discrimination and calibration, offering a more principled alternative to fixed active learning rules. Conclusion: RELEAP optimizes phenotype correction through downstream feedback, offering a scalable, label-efficient paradigm that reduces manual chart review and enhances the reliability of EHR-based risk prediction.


Integrating Text and Time-Series into (Large) Language Models to Predict Medical Outcomes

Larbi, Iyadh Ben Cheikh, Ravichandran, Ajay Madhavan, Burchardt, Aljoscha, Roller, Roland

arXiv.org Artificial Intelligence

Large language models (LLMs) excel at text generation, but their ability to handle clinical classification tasks involving structured data, such as time series, remains underexplored. In this work, we adapt instruction-tuned LLMs using DSPy-based prompt optimization to process clinical notes and structured EHR inputs jointly. Our results show that this approach achieves performance on par with specialized multimodal systems while requiring less complexity and offering greater adaptability across tasks.


Transformer-based Time-Series Biomarker Discovery for COPD Diagnosis

Gadgil, Soham, Galanter, Joshua, Negahdar, Mohammadreza

arXiv.org Artificial Intelligence

Chronic Obstructive Pulmonary Disorder (COPD) is an irreversible and progressive disease which is highly heritable. Clinically, COPD is defined using the summary measures derived from a spirometry test but these are not always adequate. Here we show that using the high-dimensional raw spirogram can provide a richer signal compared to just using the summary measures. We design a transformer-based deep learning technique to process the raw spirogram values along with demographic information and predict clinically-relevant endpoints related to COPD. Our method is able to perform better than prior works while being more computationally efficient. Using the weights learned by the model, we make the framework more interpretable by identifying parts of the spirogram that are important for the model predictions. Pairing up with a board-certified pulmonologist, we also provide clinical insights into the different aspects of the spirogram and show that the explanations obtained from the model align with underlying medical knowledge.


Controlling for Unobserved Confounding with Large Language Model Classification of Patient Smoking Status

Lee, Samuel, Wood-Doughty, Zach

arXiv.org Artificial Intelligence

Causal understanding is a fundamental goal of evidence-based medicine. When randomization is impossible, causal inference methods allow the estimation of treatment effects from retrospective analysis of observational data. However, such analyses rely on a number of assumptions, often including that of no unobserved confounding. In many practical settings, this assumption is violated when important variables are not explicitly measured in the clinical record. Prior work has proposed to address unobserved confounding with machine learning by imputing unobserved variables and then correcting for the classifier's mismeasurement. When such a classifier can be trained and the necessary assumptions are met, this method can recover an unbiased estimate of a causal effect. However, such work has been limited to synthetic data, simple classifiers, and binary variables. This paper extends this methodology by using a large language model trained on clinical notes to predict patients' smoking status, which would otherwise be an unobserved confounder. We then apply a measurement error correction on the categorical predicted smoking status to estimate the causal effect of transthoracic echocardiography on mortality in the MIMIC dataset.


Pulmonologists-Level lung cancer detection based on standard blood test results and smoking status using an explainable machine learning approach

Flyckt, Ricco Noel Hansen, Sjodsholm, Louise, Henriksen, Margrethe Høstgaard Bang, Brasen, Claus Lohman, Ebrahimi, Ali, Hilberg, Ole, Hansen, Torben Frøstrup, Wiil, Uffe Kock, Jensen, Lars Henrik, Peimankar, Abdolrahman

arXiv.org Artificial Intelligence

Lung cancer (LC) remains the primary cause of cancer-related mortality, largely due to late-stage diagnoses. Effective strategies for early detection are therefore of paramount importance. In recent years, machine learning (ML) has demonstrated considerable potential in healthcare by facilitating the detection of various diseases. In this retrospective development and validation study, we developed an ML model based on dynamic ensemble selection (DES) for LC detection. The model leverages standard blood sample analysis and smoking history data from a large population at risk in Denmark. The study includes all patients examined on suspicion of LC in the Region of Southern Denmark from 2009 to 2018. We validated and compared the predictions by the DES model with diagnoses provided by five pulmonologists. Among the 38,944 patients, 9,940 had complete data of which 2,505 (25\%) had LC. The DES model achieved an area under the roc curve of 0.77$\pm$0.01, sensitivity of 76.2\%$\pm$2.4\%, specificity of 63.8\%$\pm$2.3\%, positive predictive value of 41.6\%$\pm$1.2\%, and F\textsubscript{1}-score of 53.8\%$\pm$1.1\%. The DES model outperformed all five pulmonologists, achieving a sensitivity 9\% higher than their average. The model identified smoking status, age, total calcium levels, neutrophil count, and lactate dehydrogenase as the most important factors for the detection of LC. The results highlight the successful application of the ML approach in detecting LC, surpassing pulmonologists' performance. Incorporating clinical and laboratory data in future risk assessment models can improve decision-making and facilitate timely referrals.


An AI-enabled Bias-Free Respiratory Disease Diagnosis Model using Cough Audio: A Case Study for COVID-19

Saeed, Tabish, Ijaz, Aneeqa, Sadiq, Ismail, Qureshi, Haneya N., Rizwan, Ali, Imran, Ali

arXiv.org Artificial Intelligence

Cough-based diagnosis for Respiratory Diseases (RDs) using Artificial Intelligence (AI) has attracted considerable attention, yet many existing studies overlook confounding variables in their predictive models. These variables can distort the relationship between cough recordings (input data) and RD status (output variable), leading to biased associations and unrealistic model performance. To address this gap, we propose the Bias Free Network (RBFNet), an end to end solution that effectively mitigates the impact of confounders in the training data distribution. RBFNet ensures accurate and unbiased RD diagnosis features, emphasizing its relevance by incorporating a COVID19 dataset in this study. This approach aims to enhance the reliability of AI based RD diagnosis models by navigating the challenges posed by confounding variables. A hybrid of a Convolutional Neural Networks (CNN) and Long-Short Term Memory (LSTM) networks is proposed for the feature encoder module of RBFNet. An additional bias predictor is incorporated in the classification scheme to formulate a conditional Generative Adversarial Network (cGAN) which helps in decorrelating the impact of confounding variables from RD prediction. The merit of RBFNet is demonstrated by comparing classification performance with State of The Art (SoTA) Deep Learning (DL) model (CNN LSTM) after training on different unbalanced COVID-19 data sets, created by using a large scale proprietary cough data set. RBF-Net proved its robustness against extremely biased training scenarios by achieving test set accuracies of 84.1%, 84.6%, and 80.5% for the following confounding variables gender, age, and smoking status, respectively. RBF-Net outperforms the CNN-LSTM model test set accuracies by 5.5%, 7.7%, and 8.2%, respectively


Predicting Cardiovascular Disease Risk using Photoplethysmography and Deep Learning

Weng, Wei-Hung, Baur, Sebastien, Daswani, Mayank, Chen, Christina, Harrell, Lauren, Kakarmath, Sujay, Jabara, Mariam, Behsaz, Babak, McLean, Cory Y., Matias, Yossi, Corrado, Greg S., Shetty, Shravya, Prabhakara, Shruthi, Liu, Yun, Danaei, Goodarz, Ardila, Diego

arXiv.org Artificial Intelligence

Cardiovascular diseases (CVDs) are responsible for a large proportion of premature deaths in low- and middle-income countries. Early CVD detection and intervention is critical in these populations, yet many existing CVD risk scores require a physical examination or lab measurements, which can be challenging in such health systems due to limited accessibility. Here we investigated the potential to use photoplethysmography (PPG), a sensing technology available on most smartphones that can potentially enable large-scale screening at low cost, for CVD risk prediction. We developed a deep learning PPG-based CVD risk score (DLS) to predict the probability of having major adverse cardiovascular events (MACE: non-fatal myocardial infarction, stroke, and cardiovascular death) within ten years, given only age, sex, smoking status and PPG as predictors. We compared the DLS with the office-based refit-WHO score, which adopts the shared predictors from WHO and Globorisk scores (age, sex, smoking status, height, weight and systolic blood pressure) but refitted on the UK Biobank (UKB) cohort. In UKB cohort, DLS's C-statistic (71.1%, 95% CI 69.9-72.4) was non-inferior to office-based refit-WHO score (70.9%, 95% CI 69.7-72.2; non-inferiority margin of 2.5%, p<0.01). The calibration of the DLS was satisfactory, with a 1.8% mean absolute calibration error. Adding DLS features to the office-based score increased the C-statistic by 1.0% (95% CI 0.6-1.4). DLS predicts ten-year MACE risk comparable with the office-based refit-WHO score. It provides a proof-of-concept and suggests the potential of a PPG-based approach strategies for community-based primary prevention in resource-limited regions.


Smoking Accelerates Biological Age, Says AI

#artificialintelligence

In literature, characters that smoke are often described as haggard and older looking, with facial features that are associated with worn leather. While these depictions arguably carry over into reality, what is for certain is that the association between smoking, cancer, and cardiovascular disease is strong. Unfortunately, however, the connection between smoking and biological aging has been less clear. Yet, a new study from an international team of investigators led by scientists at Insilico Medicine may change how smoking is evaluated at the biochemical level. "In this study, we demonstrate for the first time that smoking status can be predicted using blood biochemistry and cell count results and the recent advances in artificial intelligence (AI)," the study authors explained.


Deep Learning Models Predict Cardiovascular Risk Factors from Images of the Eye

@machinelearnbot

The ability to stratify patients by cardiovascular risk is essential for identifying those likely to suffer a heart attack, stroke, or other heart disease in the future. High-risk patients can then take steps to improve their cardiovascular health. Doctors typically take into account a variety of risk factors: demographics such as age, sex and ethnicity; daily behaviors like exercise, smoking status and diet; as well as results from blood pressure and cholesterol tests. As a simple alternative to the traditional patient questionnaire and blood tests, a team of researchers from Google Research and the Stanford School of Medicine have developed deep learning models to predict cardiovascular risk factors from photographs of the back of the retina. Since these retinal fundus images are already collected for diabetic eye disease screening, this initial study suggests that deep learning could uncover additional information that could be further leveraged for preventative health.


AI trained to spot heart disease risks using retina scan

#artificialintelligence

The idea behind using a neural network for image recognition is that you don't have to tell it what to look for in an image. You don't even need to care about what it looks for. With enough training, the neural network should be able to pick out details that allow it to make accurate identifications. For things like figuring out whether there's a cat in an image, neural networks don't provide much, if any, advantages over the actual neurons in our visual system. But where they can potentially shine are cases where we don't know what to look for.